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ABSTRACT 



Aspirations for the Chicago School Reform Act of 1988 and 
more recent accountability efforts for the central office indicate that the 
Chicago Public School (CPS) system needs a credible system for charting 
academic improvement. The annual systemwide reports of student test scores 
are crude and sometimes seriously biased indicators for making judgments 
about the productivity of individual schools. This report uses Iowa Tests of 
Basic Skills (ITBS) scores for all students in grades 2 through 8 from 1987 
to 1996, data that represent 5- or 6 -year trends, depending on the school, 
for student learning under reform. The report, which initiates the "Examining 
Productivity" series, details a series of weaknesses in the current CPS 
testing and reporting system, and develops an alternative approach, a school 
academic productivity profile to summarize the changes that have occurred in 
a school. The core of this approach entails estimating the value that a 
school adds to the learning of students taught in this school. In this 
initial report, the productivity profile is developed for each school and 
used to summarize trends in reading and mathematics achievement. Subsequent 
reports will use the same data to examine the performance of schools that 
have been especially effective. The new approach creates a new test score 
metric that allows researchers to take into account the different content 
used in the various ITBS forms to better compare results across time. 
Content-referenced scales for reading and mathematics are developed. The 
productivity profile is built of two basic pieces of information for each 
grade: the input status for that grade and the learning gain recorded for 
each grade. This reflects the value added to the learning of the school’s 
students. Some specific recommendations are made to continue the development 
of the new testing and reporting system. These are: (1) alignment with CPS 

learning goals; (2) score reporting on a content- referenced scale; (3) a 
stable measurement ruler for assessing academic progress; (4) an 
accountability focus on the school's value added to student learning; and (5) 
an inclusive orientation. An appendix discusses estimating trends in school 
productivity. Attachments include the reading and mathematics rulers. 

(Contains 23 figures and 17 references.) (SLD) 
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Introduction 

The past decade saw two major changes in the governance and operations 
of the Chicago Public Schools (CPS). The Chicago School Reform Act of 
1988 devolved substantial resources and authority to local schools and made 
them responsible for their own improvement. This law established locally 
elected school councils with authority to evaluate and select the school prin- 
cipal, and devise an annual School Improvement Plan and budget. Increased 
discretionary monies, provided as part of this legislation, have fueled local 
improvement efforts including hiring additional staff; purchasing instruc- 
tional materials, equipment, and textbooks; and increased professional de- 
velopment activities. 

Beginning in 1991, the Consortium on Chicago School Research initi- 
ated a number of critical probes of Chicago’s decentralization reform. Our 
early work focused primarily on how teachers and principals in elementary 
schools reacted to this reform, how they used the opportunities it provided 
for local improvement initiatives, and the constraints they encountered in 
advancing school change. Over the last three years, we brought more in- 
tense scrutiny to reform of the city’s high schools. In both cases, we adopted 
a strong formative orientation seeking to assist both school community lead- 
ers and systemwide policy makers. We have sought to chart the progress of 
this reform and to advance the public conversation about additional changes 
needed if this reform is to culminate in major improvements in educational 
opportunities for children. 

More recent state legislation in 1995 added a new dimension to reform — 
it restructured the central office. The legislation created a corporate style 
management team, including a chief executive officer, who replaced the 
position of superintendent, and a Reform Board of Trustees, who are now 
directly appointed by the mayor. This law brought greater central account- 
ability by clarifying the powers of the chief executive officer to deal with 
non-improving schools. As the system has moved aggressively to use these 
new powers to place over 100 schools on probation and to reconstitute 
some of the most problematic among them, the need to accurately identify 
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failing schools has become more critical. To date, the system has relied pri- 
marily on a simple statistical indicator — less than 15 percent of the stu- 
dents above national norms on the Iowa Tests of Basic Skills (ITBS) — for 
this purpose. While the CPS’s efforts to intervene in failing schools have 
been generally lauded, criticisms have been raised about the specific cri- 
terion used. 

Purpose of this Paper 

Looking back to 1988, it is very clear that the Chicago Public Schools 
needed deep and profound changes. While there were a few pockets of 
excellence, taken in total it was a school system organized for failure. 
The 1988 Chicago School Reform Act banked on expanded local par- 
ticipation to challenge this dysfunctional status quo and to promote 
structural change at both the individual school and the system level. 
While reformers recognized that major changes in student learning might 
not come quickly, the ultimate bottom line for reform was improve- 
ments in academic achievement. 

Thus, the aspirations for the 1988 Reform Act as well as the more recent 
accountability efforts of the central office indicate that the CPS needs a 
credible system for charting academic improvement. As we demonstrate 
below, the annual systemwide reports of student test scores, while of great 
public interest, are crude and sometimes seriously biased indicators for 
making judgments about the productivity of individual schools. For this 
reason, several Consortium staff and affiliates, under the initiative of the 
Chicago Panel on School Policy, have been working for a number of years 
on better ways to analyze and report standardized test score data for exam- 
ining the academic productivity of the Chicago Public Schools. 

This report uses ITBS scores for all students in grades two through eight 
from 1987 to 1996. In half of the schools, where local school councils had 
the opportunity to choose their own principal in 1990, these data represent 
six-year trends in student learning under reform. For the other half of schools, 
who had the opportunity to select a principal in 1991, these data represent 
five-year trends. In both cases, sufficient time has been afforded for signifi- 
cant organizational changes to occur. A body of evidence has finally been 
assembled that makes it now possible to investigate seriously time trends in 
school academic productivity. 

This report differs from others distributed by the Consortium in that it 
is more expository in tone and somewhat more technical. We detail a set of 
weaknesses in the current CPS testing and reporting system, and develop 
an alternative approach, called a school academic productivity profile, for sum- 
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marizing the changes that have occurred in a school. The core of this ap- 
proach entails estimating the value that a school adds to the learning of 
students taught at that school. 

This report also initiates our “Examining Productivity” series. It is the 
first in a series of studies that will systematically examine the academic pro- 
ductivity of Chicago’s public elementary schools. This report develops a 
productivity profile for each school and uses these to summarize the sys- 
temwide trends over the decade from 1987 to 1996 in reading and math- 
ematics achievement. Subsequent reports will use these same data to inves- 
tigate the characteristics of schools that have been especially effective in 
their academic improvement efforts. 

The term academic productivity has a very specific meaning in the con- 
text of this report series. It refers to the contribution a school (or group of 
schools) makes to the learning of students receiving instruction in that school. 
Improving academic productivity means that the contribution to students’ 
learning is increasing over time. We detail later in the report that this is the 
most appropriate standpoint for school accountability. To be clear, improv- 
ing academic productivity does not necessarily mean high test scores. If a 
school enrolls a large proportion of weak students, the school may contrib- 
ute a great deal to their learning, but overall test scores may still be rather 
low because ofthe limited preparation that these students bring to the school. 

Before forging ahead an important caveat is in order. The analyses 
presented here, and in subsequent reports, are the best we can offer 
given the limitations of the available data. We emphasize at the outset 
that these data limitations are considerable. This report concludes that 
the CPS needs a better testing and reporting system in order to have a 
more accurate basis in the future for charting academic productivity. 
The Consortium’s Steering Committee offers a number of recommen- 
dations to frame these future developments. 

A Weak Indicator: Problems with Percentage of Students at 

National Norms 

Different statistical indicators are needed for different purposes. An in- 
dicator that is useful to describe student achievement across the whole 
system may not necessarily be well suited for examining individual school 
productivity and improvements (or declines) in that productivity. The 
Chicago Public Schools have used a variety of statistics over the years 
for reporting student achievement. These include median grade equiva- 
lent scores, median percentile ranks, and “the percent of students scor- 
ing at or above national norms.” Recently, this latter statistic has been 
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used to make important decisions about individual schools, including 
whether they are put on academic remediation or probation. 

The percent of students scoring at or above national norms was first 
calculated in response to the 1988 Reform Act, which mandated a goal for 
each school of academic achievement “that equals or surpasses national 
norms.” While this statistic does indicate a very real systemwide gap from 
the national norm, it can be problematic when used to judge changes in the 



Figure la. Initial Distribution of Student Scores 
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Figure 1b. Distribution of Student Scores after a 
Broad-Based Intervention 
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Figure 1c. Distribution of Student Scores after a 
Narrow-Based Intervention 
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performance of an individual school. The major concern is that this statistic 
is responsive to changes in the performance of only a subgroup of students — 
those who cluster close to national norms. Significant improvements in the 
learning of very low achieving students, for example, in the 10-30 percen- 
tile range, can go undetected. This is problematic since many Chicago schools 
enroll large numbers of such students. We note that the same issue arises for 
improvements among higher achieving students, for example, in the 70-90 
percentile range. These changes also would go unrecognized. 

We demonstrate the problem with a simple illustration. Figure la pre- 
sents a profile of test scores for a low achieving elementary school with 
students arrayed within 10-point percentile ranges called deciles. Let’s con- 
sider two possible school improvement scenarios. In the first case, a broad- 
based intervention is put in place that affects the achievement of all stu- 
dents, with more attention, however, focusing on the lowest achieving 
students. As a consequence, students who were originally in the lowest three 
deciles moved up about 20 percentile points. All other students improved 
by 10 percentile points. (See Figure lb.) 

In the second case, a much narrower intervention was attempted focus- 
ing only on students in the fourth and fifth deciles (i.e., the 30-49 percen- 
tile range). While this intervention was successful in moving many of these 
students toward or above the threshold of national norms, the vast majority 
of students in the school remain unaffected. (See Figure lc.) 
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In terms of the indicator of percentage of students at or above national 
norms, these two cases are indistinguishable — both improved from 10 to 
15 percent! Although these two interventions are very different in terms of 
their consequences for students, the principal criterion currently used by 
the CPS for accountability purposes would not distinguish this. 

Other statistical indicators can do a better job in this regard. The median 
percentile for the school is a somewhat better statistic because it clearly 
points out the large improvement in the first case (Figure lb) from the 20th 
to the 36th percentile. This statistic, however, does not detect the small 
improvement that did occur in the second case. An even better statistic 
for this purpose is the school mean achievement (i.e., the simple aver- 
age of all students’ test scores). 1 It correctly detects both the large im- 
provement in the First case and the small improvement in the second 
case. This occurs because the school mean achievement indicator is 
sensitive to the performance of all students. Any changes, even small 
ones, will be reflected here. We build on this idea in a subsequent sec- 
tion when we introduce a value-added indicator of school productivity . 
This indicator, which assesses the contributions that a school makes to 
students’ learning, is based on the mean learning gains for all children 
receiving instruction in a school in a given year. Here, too, the perfor- 
mance of each individual student affects the Final results. 2 

Need for a Stable Measurement Ruler over Time: 
Problems Associated with 
National ly-Normed Standardized Tests 

The ITBS is the main achievement data gathered annually by the Chi- 
cago Public Schools and is the sole information source currently used 
by the system for school accountability purposes. These tests are inex- 
pensive and relatively easy to administer and score. They are quite use- 
ful for the purposes for which they were originally intended — to tell us 
about how well our students perform against a national sample of stu- 
dents who took the same test. They were not, however, speciFically de- 
signed for the purposes we now use them for — to assess improvements 
in schools’ productivity over time. 

By way of background, the ITBS is not a single test, but rather a 
testing system. It consists of multiple forms that were developed at dif- 
ferent points in time. These forms are literally different tests with no 
overlapping items. Each form consists of multiple levels, each designed 
to be administered to students at a particular grade. For example, level 9 
is designed for grade 3, level 10 for grade 4, and so on. Although it is 
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now an infrequent practice in the CPS, students sometimes have been 
tested “off level,” such as giving level 8 to a very disadvantaged third 
grader or level 10 to a gifted student at the same grade. 

The Non-Equivalence of Grade Equivalence 

The ITBS, like most nationally norm-referenced standardized tests, pro- 
duces a score report called a grade equivalent (GE). GEs have a great deal of 
appeal to teachers and parents because they appear to describe a child’s 
performance in developmental terms of grade level and months within grades. 
Since the CPS administers the ITBS in the eighth month of the school year, 
a fourth grader’s score of 4.8 is “on grade,” “at grade level,” or “at the na- 
tional norm.” Similarly, a fifth grader who tested at grade level is assigned a 
GE of 5.8, a sixth grader who is on grade scores a 6.8, and so on. 

Since all of the test forms and levels produce GEs, the lay user might 
easily think that these results are equivalent and directly comparable. In 
fact, this is not true. To demonstrate the problems here, we gave a sample of 
CPS students two different reading and math tests from the ITBS series. In 
one case, we administered adjacent levels (8 and 9) from CPS91 (Form G), 
which was administered systemwide in 199 1 . In a second case, we adminis- 
tered the same level of the test (level 9) but from two different forms, CPS90 
(Form J) and CPS91 (Form G), which were used in 1990 and 1991. Fi- 
nally, in the third case we changed both the level and form. A sample of 
students took both test level 8 of CPS90 and level 9 of CPS91. 3 The latter 
case is interesting because it directly represents what CPS students actually 
experience. That is, as students progress across the grades, they normally 
change test levels each year. In addition, since 1990, the CPS has been 
changing the form of the test administered each year. Thus, as we consider 
the year-to-year progress of students over time, we are actually comparing 
data from two different forms and levels. 

A basic criterion for comparing data from any testing system is that stu- 
dents’ score reports should not depend upon the particular form or level of 
the test taken so long as it is appropriate for their general ability. Thus, if we 
give a child two different tests, we expect similar estimates of that child’s 
competence. While some children might do a bit better on the first test, 
and others might do somewhat better on the second, on average the two 
tests should tell us the same thing. 4 Figures 2a, 2b and 2c demonstrate, 
however, that this is not always the case with grade equivalent scores from 
the ITBS reading assessments. For example, students who were given CPS91 
(see Figure 2a) were more than twice as likely to have better GE scores on 
the higher level test (level 9) than on the lower level test (level 8). Similarly, 
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Figure 2. GE Test Score Bias Due to Form and Level Differences 



Test Pairs 



Effect 



a. 1991 test, level 9 
vs. 

1991 test, level 8 



b. 1991 test, level 9 
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1990 test, level 9 



c. 1991 test, level 9 
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Note 1 : About the same category is +/- 1 standard deviation from zero. 
Note 2: See endnote 8. 



consider the students who took the same level of the test from two different 
years (Figure 2b). These students were much more likely to do better on 
CPS90 than on CPS9 1 . These differential score effects are equally dramatic 
when we consider the comparison across forms and levels (Figure 2c). Stu- 
dents were seven times more likely to score higher on CPS91, level 9 than 
on CPS90, level 8. 

These empirical examples illustrate a general problem that grade equiva- 
lents are both form and level specific and can not be strictly compared. 
Clearly, this limits our ability to make accurate statements about how much 
actual learning an individual student is making over time. It also introduces 
a great deal of uncertainty into any assessment of whether scores may be 
going up or down over time for an individual school or across the whole 
system. While real changes in student performance are embedded here, so 
are the differences in the test scoring. 

Figure 3 presents a clear example of the problems that this can produce 
when we try to interpret grade equivalent scores to assess progress over time. 
We illustrate the GE gains made by seventh grade students in “Millard 
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Figure 3. Trends in Reading Gains: A School Effect or 
Measurement Artifact? 




Note: Figure 3c uses a box plot to display the distribution of school gains. The area inside 
the box represents gains for half of the schools; the top whisker represents 25 percent of 
the schools with the greatest gains, and the bottom whisker 25 percent of the schools with 
the lowest gains. 



Fillmore Elementary School” in 1992, 1993, and 1994. 5 The seventh grad- 
ers in 1992 gained approximately 1.0 grade equivalents over their end of 
grade six performances (see Figure 3a). The following year, seventh graders 
gained 1.7 GEs — an improvement of 70 percent. In 1994, however, stu- 
dent gains fell hack to 0.7 GEs — worse than where they started two years 
earlier! Why did this school suddenly lose the productivity improvement 
from the year before? What went wrong? 

In fact, it is quite likely that nothing went wrong in 1994, and prohahly 
nothing went right in 1993 either. This pattern of results is not distinctive 
to Fillmore; it occurred generally across the entire school system. Figure 3c 
presents a set of hox plots that displays the seventh grade gains in these 
same three years for all Chicago elementary schools. Notice that in most 
schools, seventh grade gains went up in 1993 and then fell hack down in 
1994. The median CPS elementary school went from 1.0 GE gain in 1992 
to a 1.5 GE gain in 1993 and then back down to 0.9 GE gain in 1994. 
While Fillmore students gained a bit more in 1993 and lost a bit more in 
1994, their results closely follow the overall system trend. 
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Figure 4a. ITBS Mathematics Content Changes: 
What the ITBS Tests in Grade 3 
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Unfortunately, many educators and most of the public are unaware of 
these inherent limitations in the grade equivalent metric. These score re- 
ports are simply not designed for purposes of making inferences about change 
over time. Clearly, a better reporting metric is needed if we wish to assess 
accurately whether school productivity is improving. 

A Non-Standard Standard 

A second problem with the use of the ITBS for productivity analysis emerges 
when we consider the actual content of the tests. The skills assessed by the 
ITBS have changed over this 10-year time period. Thus, when we look at 
10-year trends in score reports, we are, in essence, judging students, schools, 
and the system against a moving target. Unfortunately, this changing target 
is largely hidden in a secure test and unknown to most educators. As a 
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Figure 4b. ITBS Mathematics Content Changes: 
What the ITBS Tests in Grade 6 




result, a teacher may see, for example, that students in her classroom clearly 
know more mathematics than previous classes of students, but their stan- 
dardized test scores may still come back lower. 

To document this problem of changing standards, we did a content analy- 
sis of the ITBS forms used by the CPS from 1 990 through 1 996. We grouped 
the ITBS math test items into 12 major categories ranging at the easy end 
from “money and time” and “addition” problems to the more complex tasks 
of “equations,” “multiplication,” “division,” and “fractions.” 6 Figures 4a, 
4b, and 4c compare the relative frequency of these 1 2 different item types 
in the tests used from 1990 and 1992 with those used from 1993 through 
1996 for grades 3, 6, and 8 respectively. 

Clearly, a major content shift occurred beginning in 1993. A new topic 
on “data related concepts” appeared. There was also a major increase in 
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Figure 4c. ITBS Mathematics Content Changes: 
What the ITBS Tests in Grade 8 




“equation” problems across all grades. This, in turn, was compensated by a 
decline in the proportion of basic computation items involving “addition,” 
“subtraction,” “multiplication,” “division,” and “number problems.” 

These patterns reflect gradually changing professional judgments about 
the appropriate content for elementary school mathematics curriculum. 
Beginning with the National Council of Teachers of Mathematics 
(NCTM) standards in 1989, there has been an emphasis on introduc- 
ing more challenging mathematics into elementary schools. Test pub- 
lishers such as Riverside, producer of the ITBS, pay close attention to 
these developments. In general, the content of national norm-referenced 
tests is deliberately designed to sample broadly from the different kinds of 
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curricula that schools may be implementing in order to provide a basis for 
global comparisons of how students in a particular school or district com- 
pare with a national sample of children who took the same test. The tests 
are purposefully not aligned with any one curricular strategy so as to be 
useable across a wide range of schools. As a result, they are a very blunt 
instrument for assessing increasing productivity in a particular curriculum 
because only a modest portion of the test may be assessing what schools are 
actually trying to teach students in any given grade. For example, while the 
tests used in 1993 through 1996 reflect some movement toward the NCTM 
standards, few math educators would consider these authentic tests of the 
more challenging mathematics envisioned in the NCTM. 

We again return to our general point. The ITBS system was simply 
not designed for the purposes to which it is now directed. The testing 
system was intended to compare the competence level of an individual 
or group of students relative to a national sample who took the same 
particular test of basic knowledge and skills. For this comparison to be 
relevant, the tests try to represent at least some of what children might 
be asked to learn in a wide range of districts. Moreover, it is quite natu- 
ral to change the content of norm-referenced tests over time as ideas 
about instruction shift. This helps to keep comparisons across districts 
as relevant as possible. This latter principle, however, proves problem- 
atic when we switch purposes toward assessing changes in school pro- 
ductivity in a single district. An absolute prerequisite for valid studies of 
change is a constant measurement ruler. 

The Alternative: A Content-Referenced 
Measurement System 

The problems laid out above offer a formidable challenge to any simple 
assessment of changing productivity in the CPS. We found it necessary to 
create a new test score metric that allows us to take into account the differ- 
ent content used in various ITBS forms in order better to compare results 
across time. For this purpose, we undertook a major equating study of all 
forms and levels of the ITBS used in Chicago from 1987 through 1996 at 
grades 1 through 8. (See details of the equating study in the sidebar on page 
17.) This test equating produced a content-referenced scale that offers a 
common metric against which persons and schools can be assessed. The 
scale is constructed around the relative difficulty of the test items for CPS 
students. Each student is then measured against this content-referenced scale. 
Any student s scale scores can be directly interpreted in terms of the kinds 
of items that the student is likely to answer correctly and those that he or 
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Figure 5. Rasch Test Score Bias Due to Form and 
Level Differences 
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Note 1 : About the same category is +/- 1 standard deviation from zero. 
Note: See endnote 8. 



she is unlikely to know. In this process, we are adjusting for the variations in 
test content across forms and levels. A particular test is now simply a set of 
items, each of which has its own unique difficulty. By knowing the diffi- 
culty of the items a child got right and wrong, we can calculate a content- 
referenced scale score. 

The major advantage of the content- referenced metric is that the scale 
scores of students of similar competence or ability should no longer depend 
on the specific form and level of the ITBS they receive. 7 In Figure 5 we 
present the same data as previously analyzed in Figure 2. Figures 5a, 5b, and 
5c show that, regardless of the specific level or form administered, a student 
is no more likely to do better or to do worse. The key difference as com- 
pared to the GE metric is that, while some students still do better on the 
first test and some do worse, on average, there is no bias. That is, a student 
has an equal chance of doing better or worse on the second administration. 
This is reflected in Figure 5 by the fact that the percentage of students 
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doing better and the percentage doing worse are approximately the same in 
all three panels. 

On balance, the results presented here illustrate the kinds of improve- 
ments that can occur when test scales are content-difficulty referenced. Our 
equating design involved 24 different situations or links> where students 
took two different forms and/or levels of the ITBS. The GE metric showed 
bias in half of the cases! The equating removed the bias in eight cases, ef- 
fected improvement in three situations, and exacerbated it in one case. While 
this is an improvement, it is less than ideal. 8 To establish better test compa- 
rability, the mechanism for test equating needs to be built directly into the 
design of the testing program rather than treated as a special study as we 
have done in this research. 

The ITBS as Content-Referenced Scales 

The reading and mathematics measurement rulers — Figures 6 and 7 (in- 
cluded separately) — present the content-referenced scales for the reading 
and math series that we developed from the equating study. In both cases, 
the scale has been established so that test scores run from 0 to 100. These 
content-referenced scales form a developmental metric. Higher scores indi- 
cate more advanced student competency. The scales have been anchored 
such that a score of 20 is comparable to being at national norms at the end 
of first grade, and a score of 80 is consistent with being at national norms at 
eighth grade, based on the average of 1987 to 1996 ITBS scores.^The Chi- 
cago grade-level averages for 1996 are represented in the blue bars at the top 
and bottom of each scale. For example, the fourth grade average reading 
scale score was 48 in 1996; for sixth grade it rose to 60.5. The comparable 
results for math at grades four and six were 48 and 65 respectively. 

The scale score for any student (or the average score for an individual 
school) is directly related to the specific content that constitutes the test 
series. For example, students with scale scores of 50 on the reading test 
have a 75 percent probability of answering correctly the items clustered 
around that scale value (e.g., items C4, Dl, and E2,). They are even 
more likely to get the simpler items (e.g., Cl and D3) correct. They are 
less likely, however, to answer correctly the harder items, for example 
those associated with passage F. 

In short, the scale score provides specific information about what stu- 
dents know and can do. This is what we mean by a content-referenced, as 
contrasted to a solely norm-referenced, testing system. 

The reading scale. The reading scale is defined by the difficulty of the 
reading passages and the individual items associated with each passage. We 
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present here a sample of tasks from Form 7, which was used by the CPS up 
to 1989, to illustrate the content difficulty that forms the overall scale. In 
general, the reading tasks become more difficult as we move from left to 
right across the scale. 10 Each sample passage has been selected so as to illus-r 
trate what a student who is approximately on that grade level should be able 
to read well. For example, passage E about fireflies represents the kind of 
text that an on-level student in grade three should be able to comprehend 
and answer questions about. 11 The difficulties of selected individual items 
for each passage are referenced against the scale at the bottom of the page. 
Notice that the items vary considerably in their difficulty even within a 
single passage. For example, item El associated with the fireflies passage is 
relatively simple to answer and has a scale difficulty of 39; in contrast, item 
E3 is almost 20 scale points harder. 

In general, the easiest passages (i.e., with lower scale score difficulties) 
involve short simple narratives. The items associated with them tend to ask 
simple factual questions and make little or no evaluative demands on the 
student. The questions associated with fireflies offer good examples of this. 
In contrast, passages on the right draw on more specialized subject matter 
and offer a more detailed exposition of facts. These passages also tend to use 
more complex sentences with less common vocabulary. For example, pas- 
sage H is about the Floating Market in Thailand — a topic with which most 
Chicago students would not have had any firsthand experience. These up- 
per level passages sometimes tap other literary genres, such as passage 1, 
which is a poem. Items associated with such passages typically elicit the 
reader s overall impression (or inference) of what a passage is about in its 
mood, tone, and meaning. 

The mathematics scale. The easiest items in mathematics probe students’ 
ability to count, perform simple addition, and tell time. These typically 
have item difficulties of 20 or less. Next come subtraction and multiplica- 
tion tasks which become more common around scale values in the 20s and 
30s. As we move farther up the scale, the computation tasks become more 
complex and involve other operations such as division and fractions. Word 
problems and tasks involving equations become more frequent as well. Some 
topics, such as geometry problems, span almost the entire scale, but the 
questions become more complex. For example, a simple geometry problem 
of identifying shapes has a scale difficulty of 16; in contrast, a geometry 
problem involving lines and angles has a difficulty of 82. 

The interpretation of students’ scale scores follows the same basic 
logic as the reading scale. For example, the average Chicago first grader 
in spring 1996 had a scale score of 22. Such students are likely to able to 
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Equating the ITBS 

We conducted a series of four separate studies to equate the six different forms of the ITBS 
used from 1987 through 1996. 12 These studies involved both vertical equating (that is, linking 
different levels within the same test form, such as grades three and four tests given in 1990) as 
well as horizontal equating ( that is, linking similar levels in different tests, such as third-grade tests 
given in 1991 and 1992). In order to accomplish horizontal equating, four studies were undertaken 
where students completed two different tests. This created the necessary links to make scores com- 
parable across forms. 

Within each form, test levels 9 - 14 are linked by common items that appear on more than one 
I test level. This provided the basis for the vertical equating among test levels. For levels 7, 8, and 9, 
which share no common items within a form, the vertical links were established by groups of stu- 
dents who took two of these different test levels at the same time. These groups ranged in size from 
1 50 to 450 students. Additional data from students who took single test levels (about 1000 people 
per test level) were included to improve the precision of the item difficulty estimates. For those 
students who took two tests, the order of test administration was varied. This counterbalancing 
design was employed to prevent systematic effects of fatigue, boredom, and differential effort and 
motivation. 

The actual statistical equating relied on a method of test item calibration called Rasch analysis. 
The Rasch model is a member of a class of scaling models based on item response theory (IRT) j 
currently used by most modern testing programs such as the NAEP, the SAT and the TOEFL. Item j 
difficulties for all forms and levels are placed on the same scale. This is intended to assure that all • 
measures are directly comparable. I 

To be sure, issues of comparability can still arise as small changes in the design of a testing j 
program can have a significant impact on observed student performance. The intent in the techni- ; 
cal design of the assessments, however, is to assure greater comparability than is now the case. 



do simple two-digit addition with no regrouping and even more likely 
to answer correctly simple addition and time problems. Questions that 
ask simple multiplication facts (e.g., 3x3=? which has a scale difficulty 
of 31) would likely be too difficult. Similarly, the typical eighth grader 
in the CPS in 1996 (scale score of 76) would likely show mastery over 
most computation tasks (except for the most complex division and frac- 
tion problems). But he or she would encounter difficulty with more 
complex word problems (e.g., the distance, rate, and time problem il- 
lustrated with a scale difficulty of 81), or with problems requiring a 
solution to a linear equation system in two unknowns (scale score of 
88), or finding the roots of a quadratic equation (scale difficulty of 91). 



A Good Indicator of School Productivity: 

A Value-Added Approach 

We showed earlier that a school mean provides a better statistical summary 
of the overall attainment of students in a school or district because the 
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Figure 8. Average Percentage of Students Remaining in the 
Same School after One through Four Years 





Stable 



Mobile 



performance of every student affects the indicator value. This statistic is 
most useful for informing us about the overall level of students’ capabilities. 
Moreover, if we track this indicator over time, it will tell us about possible 
changes in overall student attainment. 

The average achievement level, however, is not an especially good indica- 
tor of school productivity and whether this is changing over time. One 
major problem that this indicator fails to take into account is student mo- 
bility. For example, if a group of students enrolls in a school sometime 
during the academic year, even on the day just before testing, their scores 
will be counted as part of the overall achievement level for the school. Clearly, 
the attainment for these students depends primarily on their previous school- 
ing experiences and home background and tells us virtually nothing about 
the effectiveness of the particular school. 

This concern is especially problematic in urban school districts such as 
Chicago because student mobility tends to be high. In the typical Chicago 
elementary school only 80 percent of the students tested in a given year 
were also tested in the same school the previous year. This means that 20 
percent of the students are new each year. 13 (See Figure 8.) Over a third of 
the students are new to schools over a two-year period. 

Additional problems arise as we examine trends over time. Consider, for 
example, a school in a “port of immigration” neighborhood. Many of the 
students enrolled in the neighborhood school will not be native English 
speakers and, as a result, their measured initial standardized test scores will 
typically be low. (Further complicating the problem, the CPS currently has 
no tests designed to measure how well non-native speakers are learning 
English.) As these students progress through a few years of schooling, their 
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academic attainment is likely to improve, but they may also leave the school 
as their family develops opportunities to move into better housing. New 
immigrants in the community replace these students, and the cycle begins 
anew. Clearly, the average attainment level for such a school is not likely to 
get very high because teachers are constantly working with new students. 
While school staff may do a terrific job contributing to the learning of 
students who are enrolled, few students stay long enough to significantly 
affect the bottom line of average student attainment. 

More generally, if the student composition of a school is changing over 
time, the average achievement levels might well rise or fall, but this would 
tell us little about any possible changes in school effectiveness. Clearly, we 
need to take such factors into account in developing an appropriate indica- 
tor for purposes of assessing school productivity and whether this is chang- 
ing over time. 14 In order to do this, we begin with a basic accountability 
principle: A school should be held responsible for the learning that occurs 
among students actually taught in that school. This suggests that rather 
than focusing exclusively on the average achievement levels at each grade 
level, we also consider the gains in achievement made by students at each 
grade in the school for each year. 15 

In addition, as we examine trends in achievement gains over time, we 
need to take into account other factors that might also be changing during 
this period that could affect the observed learning trends. For example, over 
the 10-year period of this study, the CPS changed its procedures concern- 
ing eligibility requirements for the testing of bilingual students. Similarly, 
grade retention policy changed. Both of these policy changes could very 
well affect the gains recorded at some grade levels and schools. As a general 
rule, we want to adjust for the effects of such extraneous factors so that any 
changes over time in a schools value-added to learning will signal real im- 
provements (or declines) in school productivity. 

The Grade Productivity Profile 

With these ideas as background, we now proceed to define a productivity 
profile for each school. The school profile is composed of a set of grade 
profiles, one for each grade in the school for which entry and exit data are 
available. Figure 9 develops the idea of a grade productivity profile using 
test data from grade six at Fillmore School. 

The productivity profile is built up out of two basic pieces of informa- 
tion for each school grade: the input status for the grade and the learning 
gain recorded for that grade. The input status captures the background knowl- 
edge and skills that students bring to their next grade of instruction. To 
estimate this input status, we began by identifying the group of students 
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Figure 9. Constructing the Grade Productivity Profile 
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who received a full academic year of instruction in each grade in each school, 
and then retrieved their ITBS test scores from the previous spring. As noted 
above, students who move into and out of a school during the academic 
year do not count in the productivity profile for that year . 16 For our illustra- 
tive case of grade six at Fillmore School, we retrieved the end of grade five 
test scores for students who spent grade six at the school. The average of 
these students previous years test scores is the input status for that school 
grade. This input status is what teachers had to build on to advance the 
learning of the stable sixth grade students at Fillmore School that year. 

As for the learning gain for each school grade, this is simply how 
much the end of year ITBS results have improved over the input status 
for this same group of students. In terms of our case example of grade 
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six instruction at Fillmore School, the learning gains for the stable grade 
six students is how much their test scores have improved over the grade 
five scores from the previous year. Finally, by adding the learning gain to 
the input status we recover the third piece of information - the output 
status. This tells us about the knowledge and skill levels of these stu- 
dents at the end of a year of instruction. This would be at the end of 
grade six in our Fillmore School example. 

The grade productivity profile is organized around data from some base 
year. In our analyses of productivity for CPS schools we have selected 1991 
as the base year. 17 Panel 9a displays the base year input status, learning gain, 
and output status for grade six at Fillmore School. We then add to this in 
panel 9b the grade six data for years prior to and post 1991. We have repre- 
sented now all of the basic data for examining academic productivity in 
grade six at Fillmore School. 

Our interest in changing school productivity directs attention to the varia- 
tion over time reflected in these data. A visual scan of panel 9b suggests that 
the inputs to grade six at Fillmore School may be declining over time. Coun- 
tering this, the learning gains appear to be increasing and with this, the 
outputs also appear to be increasing. To make this clearer. Panel 9c adds an 
input trend, and output trend to the profile. Notice that each of these trend 
lines varies considerably from year to year. This variability in the data tends 
to obscure any overall pattern. To highlight this better we compute smoothed 
trends that involve estimating the best summary line that fits these data. 
These are presented in Panel 9d. To make the trends even clearer, Panel 9e 
presents the trend lines with the basic data removed. 

Indeed, the inputs to grade six have declined, but the learning gains in- 
creased. The latter is reflected by the fact that the input and output trend 
lines spread apart over time. Moreover, since the learning gains increased 
faster than the input decline, a positive output trend is the net effect. Key to 
making such judgments is the estimation of smoothed trend lines through 
the use of a statistical model. (See the Appendix for a description of the 
model and discussion of estimation issues.) The analysis generates our most 
concise visual summary of a grade productivity profile. Panel 9f illustrates 
the final representation of this. 

The fitting of a statistical model to smooth the trend lines also serves 
another important function. It allows us to adjust the trend estimates for 
other factors that might be changing over time besides school effectiveness. 
In seeking to develop the best possible estimates of school productivity for 
the CPS, we considered a range of factors including changes in a school’s 
ethnic composition, percentage of low income students, retention rates. 
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percentage of students enrolled who are old for their grade, and the propor- 
tion of bilingual students. Generally, the effects associated with these fac- 
tors were not large. In addition, most CPS schools did not vary much on 
most of these factors over the 10-year period from 1987 to 1996. As a 
result, the adjusted trends were quite similar to the unadjusted estimates. 18 

Finally, we use our estimate of a schools learning gain trend to quantify 
school improvement in the form of a learning gain index (LG I). This quan- 
tity assesses the relative change in student learning over the last five years as 
compared to the amount of learning that occurred across the system in the 
base year, 1991. 19 

Classifying Productivity Profiles 

Each grade profile involves three different trends: input, learning gain, and 
output trends. If we know any two of these, the third can be inferred di- 
rectly. Observing only one of the three, such as when we monitor an output 
trend or a gain trend separately, can be misleading. 

Much of the recent literature on school accountability emphasizes use of 
the learning gain trends for purposes of judging productivity. 20 As we began 
this study, we intended to focus exclusively on the learning gain trend or 
value-added to student learning for judging school productivity. 21 Gradu- 
ally, however, we came to conclude that while the statistical arguments for 
using the learning gain or value-added trend were sound, these arguments 
were too narrow on both educational and policy grounds. We elaborate our 
concerns through two examples. 

First, consider the grade profile in Figure 10a. Notice that the output 
trend is up substantially. However, the input trend for the grade is also 
increasing at the same rate, and the estimated learning gain trend is flat. 
(Formally, the estimated LGI is 0 percent.) Visually the input and output 
trends are parallel lines, implying no change over time in the value added to 
student learning. While most educators would consider the output trend to 
be indicative of a reform success, focusing only on the learning gain trend 
would lead us to conclude that no significant change in instruction had 
occurred in this school grade. 

Lets think about what might actually be occurring educationally in 
“School A. 15 The students entering each year are more advanced than the 
previous years students (i.e., the input trend is positive). The teachers must 
recognize this and each year modify their plans of instruction. Since at least 
some of the instruction will be new each year, teachers must also engage in 
continuous formative evaluation — trying to figure out what is working and 
what is not and adjusting accordingly. In the absence of such teacher activ- 
ity, we might expect a profile more like Figure 10b. Here, the improving 
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Figure 10. Grade Productivity Profiles 
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Note: LGI = Learning Gain Index, computed for 1992-1996. 



inputs go unrecognized, teachers continue to teach as they have in the past, 
and succeeding student cohorts make less progress because, increasingly, 
instruction is simply a repeat of past lessons. (The learning gain trend is 
actually negative here. The LGI is -18 percent.) In essence, one could argue 
that Figure 10b, and not Figure 10a, is the “no change” case in that Figure 
10b represents the trends that we might expect to occur if teachers are not 
proactive change agents. 

Now let’s consider another case represented in Figure 10c. Both the in- 
put and output trends are declining, but the input trend is declining at a 
fester rate. This pattern results in a positive learning trend (LGI =78 per- 
cent) that is reflected in the distance between the two trend lines increasing 
over time. While from a strict value-added perspective, this is a case of 
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reform success (improving learning gains over time), it would still be prob- 
lematic to hold up this case as an exemplar of improved performance. At a 
minimum, we would want to distinguish it from a school grade with a 
productivity profile more like Figure lOd. Here, both the output trend and 
learning gain trends are improving over time. This clearly is a success story! 

Examples such as these have led us to conclude that we should employ a 
dual indicator comparison scheme. Specifically we need to look simulta- 
neously at both the learning gain trends and output trends to classify im- 
provement efforts. Taken together, these two trends provide a detailed sum- 
mary of changing school productivity over time. 

From visually inspecting a large number of grade productivity profiles, 
we were able to identify nine distinct patterns among output and learning 
gain trends. These are presented in Figure 1 1. Each cell in this table is 
based on whether the output and learning gain trends are up, flat, or down 
respectively. Some patterns, such as 1, 5> and 9, are straightforward to 
interpret. These represent “Up,” “No Change,” and “Down” in academic 
productivity. We describe patterns 2 and 4 as “Tending Up since there is 
some evidence of improvement in either output or learning gain trends. 
Similarly, we describe patterns 6>and 8 as “Tending Down because there is 
some evidence of real decline. Patterns 3 and 7 are the hardest to interpret 
since the learning gains and output trends are going in opposite directions — 
one is improving while the other is declining. Without knowing more about 
the particulars of a school case like this, we call these “Mixed” profiles. The 
result is a 7-category scheme for describing grade productivity trends. 22 

Summarizing School Productivity 

While we compute productivity profiles for each grade, we do not recom- 
mend that an accountability system use only single grade information. Our 
statistical analyses have identified negative relationships among profiles in 
adjacent grades. That is, improving productivity at one grade tends to be 
followed by some declines at the next, and the reverse is also true.23 As a 
result, judging a school by looking at only selected grades can be mislead- 
ing. We would be better off, from a statistical perspective, to average across 
adjacent grades to develop a more stable estimate of school productivity. 

Educational concerns also push us in this same direction. The design of a 
good accountability system should promote cooperative improvement ef- 
forts among a faculty in articulating curriculum across grade levels, evaluat- 
ing improvement efforts, and tracking the progress of students through 
schooling. This too suggests aggregating adjacent grade level profiles to- 
gether to focus accountability analyses on the performance of meaningful 
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